Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Support taxa="plants" in EnsemblGene #153

Merged
merged 13 commits into from
Nov 19, 2024
Merged

Conversation

mossjacob
Copy link
Collaborator

Some use cases, for example adding plants, requires adding a keyword argument specifying the kingdom in EnsemblGene.

@sunnyosun
Copy link
Member

Could you check if this is the correct db?

f"mysql+mysqldb://anonymous:@ensembldb.ensembl.org/{self._organism.core_db}"

I think for plants it's a different db.

@mossjacob
Copy link
Collaborator Author

I think it should be the same database (although it is not working). See https://plants.ensembl.org/info/data/mysql.html

Ensembl Genomes databases from all five divisions are located on the same server

and

The following conventions apply:
core databases - <genus_species>core<eg_version><ensembl_version><assembly_version>

But I can't locate, for example, a database such as

arabidopsis_thaliana_core_57_10

Which should be there as it is a model organism..

@sunnyosun
Copy link
Member

sunnyosun commented Nov 12, 2024

Could you try mysql+mysqldb://anonymous:@ensembldb.ensembl.org/plants/{self._organism.core_db}?

@mossjacob
Copy link
Collaborator Author

also an Unknown database, unfortunately!

@mossjacob
Copy link
Collaborator Author

mossjacob commented Nov 12, 2024

This is particularly odd: loading from mysql+mysqldb://anonymous:@ensembldb.ensembl.org works for items in

https://ftp.ensemblgenomes.ebi.ac.uk/pub/current/vertebrates/mysql/

but not for items in

https://ftp.ensemblgenomes.ebi.ac.uk/pub/current/plants/mysql

@mossjacob
Copy link
Collaborator Author

Found the fix. Needed to add the right port for some reason:

mysql+mysqldb://anonymous:@ensembldb.ensembl.org:4157/arabidopsis_thaliana_core_60_113_11

It seems the plant and the vertebrates databases use different ports. I will alter this PR to include a check.

@sunnyosun
Copy link
Member

What about using mysql-eg-publicsql.ebi.ac.uk instead of ensembldb.ensembl.org? It's the first option here: https://plants.ensembl.org/info/data/mysql.html

@sunnyosun sunnyosun changed the title Adds option to specify plants or vertebrates when loading an organism's genes Adds option to specify kindeom=plants in EnsemblGene Nov 12, 2024
@sunnyosun sunnyosun changed the title Adds option to specify kindeom=plants in EnsemblGene Support kindeom=plants in EnsemblGene Nov 12, 2024
@sunnyosun sunnyosun changed the title Support kindeom=plants in EnsemblGene ✨ Support kindeom=plants in EnsemblGene Nov 12, 2024
@sunnyosun sunnyosun changed the title ✨ Support kindeom=plants in EnsemblGene ✨ Support kindeom="plants" in EnsemblGene Nov 12, 2024
@mossjacob
Copy link
Collaborator Author

mossjacob commented Nov 12, 2024

@sunnyosun sunnyosun changed the title ✨ Support kindeom="plants" in EnsemblGene ✨ Support kindom="plants" in EnsemblGene Nov 12, 2024
@mossjacob
Copy link
Collaborator Author

I think this is still incomplete, actually--loading in the downloaded Ensembl plant genes doesn't work because the data frames don't have the ensembl_gene_id column which is set in bionty.Gene._ontology_id_field. Instead, they have stable_id or ncbi_gene_id columns. Unsure if I should create a new Gene model, maybe PlantGene with a different _ontology_id_field or whether you think there is a better way to resolve this?

@Zethson
Copy link
Member

Zethson commented Nov 13, 2024

@mossjacob I'll look into this. I'll report back

@sunnyosun
Copy link
Member

I think this is still incomplete, actually--loading in the downloaded Ensembl plant genes doesn't work because the data frames don't have the ensembl_gene_id column which is set in bionty.Gene._ontology_id_field. Instead, they have stable_id or ncbi_gene_id columns. Unsure if I should create a new Gene model, maybe PlantGene with a different _ontology_id_field or whether you think there is a better way to resolve this?

Ah, this is similar to the yeast case, could you take a look here? https://bionty-assets-gczz.netlify.app/ingest/gene-ensembl-release-112#saccharomyces-cerevisiae

@mossjacob
Copy link
Collaborator Author

Thanks, I took a look at that notebook. I think it is the same as that example, and I get the same output ("no ensembl_gene_id found, writing to table_id column."), but then when I try to run:

gene_source = bt.Source().filter(organism="plants", entity="bionty.Gene").first()
bt.Gene.import_from_source(source=gene_source)

it looks for the field ensembl_gene_id which doesn't exist for these tables, in line 241 _from_values.py of lamindb;

result = public_ontology.inspect(iterable_idx, field=field.field.name, mute=True)

Signed-off-by: zethson <[email protected]>
@sunnyosun sunnyosun changed the title ✨ Support kindom="plants" in EnsemblGene ✨ Support kingdom="plants" in EnsemblGene Nov 13, 2024
@Zethson
Copy link
Member

Zethson commented Nov 14, 2024

Dear @mossjacob,

sorry, I'm still catching up.

  1. Please don't be confused by the failing CI. This is an issue with forked repositories not having access to our secrets. It's on my TODO list to changes.
  2. I pushed a few changes to your PR that refactor it a bit and also fix a few issues with the download of the Gene ontologies. I am surprised that you got it to run with the current code because with the latest versions of the dependencies it didn't work for me anymore.
  3. I created a PR that demonstrates the usage ✨Arabidopsis thaliana release 57 bionty-assets#41 - I guess this is also where you ended up. See: https://6735e98c1a52cf5bf1340a87--bionty-assets-gczz.netlify.app/ingest/plant-gene-ensembl-release-57

After lunch, I will look into your last issue. I'll report back!

Signed-off-by: zethson <[email protected]>
@Zethson Zethson changed the title ✨ Support kingdom="plants" in EnsemblGene ✨ Support taxa="plants" in EnsemblGene Nov 14, 2024
@Zethson
Copy link
Member

Zethson commented Nov 14, 2024

Concerning

Thanks, I took a look at that notebook. I think it is the same as that example, and I get the same output ("no ensembl_gene_id found, writing to table_id column."), but then when I try to run:

gene_source = bt.Source().filter(organism="plants", entity="bionty.Gene").first()
bt.Gene.import_from_source(source=gene_source)

it looks for the field ensembl_gene_id which doesn't exist for these tables, in line 241 _from_values.py of lamindb;

result = public_ontology.inspect(iterable_idx, field=field.field.name, mute=True)

@sunnyosun made me aware that this is a current limitation of from_source that does not support stable_id. I'll make an issue for this. What works for saccharomyces cerevisiae (which is similar to yours as you noted above) is the following:

!lamin init --storage run-tests --schema bionty

import lamindb as ln
import bionty as bt

# The instance is empty. Therefore, we add saccharomyces cerevisiae
bt.Organism.from_source(ontology_id="NCBITaxon:559292").save()

# Save all gene records to the instance
genes = bt.Gene.from_values(bt.Gene.public(organism="saccharomyces cerevisiae").df()["stable_id"],
                                    field="stable_id",
                                    organism="saccharomyces cerevisiae"
                                    )
ln.save(genes)

# Look at our new genes
bt.Gene.df()

Does this help you? Edit: Sorry for closing - I fat fingered the wrong button.

@Zethson Zethson closed this Nov 14, 2024
@Zethson Zethson reopened this Nov 14, 2024
@mossjacob
Copy link
Collaborator Author

Hi! Thank you for this!
I will try this out at some point today or tomorrow. For now I was using:

prev_ontology_id = bt.Gene._ontology_id_field
bt.Gene._ontology_id_field = "stable_id"
bt.Gene.import_from_source(source=gene_source)
bt.Gene._ontology_id_field = prev_ontology_id

which I know is not ideal!

@sunnyosun
Copy link
Member

sunnyosun commented Nov 14, 2024

Hi! Thank you for this! I will try this out at some point today or tomorrow. For now I was using:

prev_ontology_id = bt.Gene._ontology_id_field
bt.Gene._ontology_id_field = "stable_id"
bt.Gene.import_from_source(source=gene_source)
bt.Gene._ontology_id_field = prev_ontology_id

which I know is not ideal!

It's super cool that you figured this out even though _ontology_id_field is not user-facing at all!

Then no need to try from_values, we'll make a proper fix for import_from_source!

@mossjacob
Copy link
Collaborator Author

I enjoy a good debug :)
Thanks!

@Zethson
Copy link
Member

Zethson commented Nov 14, 2024

Okay so apparently Pandas 2.2 is not compatible with sqlalchemy 1.4 which I still had on my PC. I reverted the changes now that I made earlier to the SQL statements that fixed that.

I'll make the CI run on this PR soon and then we can consider merging this.

Would you like us to also add some plant organisms genes such as arabidopsis thaliana to Bionty so that it works out of the box for you?

@falexwolf
Copy link
Member

falexwolf commented Nov 14, 2024

sqlalchemy < 2 is no good to use anymore! 😇 😆

@falexwolf
Copy link
Member

Impressively low-level contributions! @mossjacob 😄

@mossjacob
Copy link
Collaborator Author

mossjacob commented Nov 14, 2024

Thanks everyone. Re adding to bionty-assets, while that would be nice, I envisage using quite a few different species so adding all to bionty-assets may be overkill at this point?

In this PR there's a for loop for adding multiple organisms, and I'm also using the code below to download on the fly:

def verify_organism_exists(organism, version="release-57"):
    if bt.Source().filter(organism=organism).count() == 0:
        # Try syncing
        bt.core.sync_all_sources_to_latest()
        if bt.Source().filter(organism=organism).count() == 0:
            # If the source still does not exist, then download it.
            print("Organism does not exist in bionty.")
            print(f"Attempting to download {organism}...")
            ensembl_gene = EnsemblGene(organism=organism, version=version, kingdom="plants")
            print("URL:", ensembl_gene._url)
            df = ensembl_gene.download_df()
            df["description"] = df["description"].str.replace(r"\[.*?\]", "", regex=True)
            filename = f"df_{organism}__ensembl__{version}__Gene.parquet"
            df.to_parquet(filename)
            print(f"Downloaded {organism} to {filename}")
            raise ValueError(f"Add '{filename}' to sources_local.yaml and run bt.core.sync_all_sources_to_latest()")

With the change I made in this other PR, the URL to the local parquet file created can be added to sources_local.yaml and synced.

[edit] the code has been updated

@mossjacob
Copy link
Collaborator Author

Slightly altered the way gene tables are onboarded: the check for the ensembl_gene_id column to consist only of ENS-prefixed IDs is quite strong; for example, for rice (Oryza sativa), some IDs are prefixed by ENS (seems to be mostly RNA) and protein-coding genes are prefixed by "Os". Without this change, all genes are removed from the df in the else clause.

@Zethson
Copy link
Member

Zethson commented Nov 15, 2024

Great @mossjacob! Thank you very much for your enthusiasm and contributions.

  1. I had renamed kingdom to taxa. Is that fine with you or would you prefer kingdom?
  2. There's a couple of follow up issues here that I would like to tackle in the future.
    2.1 Making this less Ensembl focused because the class is even named like that. We can generalize this better. This includes import_source currently only supports ensembl_gene_id but not stable_id or other IDs #160
    2.2 Currently the code is weirdly mixing organism and taxa. We were overloading the organism parameter to handle both but this doesn't really make sense. I would like to decouple that more clearly to get rid of the tiny hack that I introduced in this PR.

I am ready to merge the PR now unless you want to keep building here? We'll also merge your sister PR for local parquet files then.

@mossjacob
Copy link
Collaborator Author

Hi @Zethson , I am also ready for this to be merged in now! I still have to write a test for the local parquet file PR though.
Many thanks

@Zethson Zethson merged commit a0ce2c3 into laminlabs:main Nov 19, 2024
3 checks passed
sunnyosun referenced this pull request Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants